In this project I performed an exploratory analysis on data provided by Ford GoBike, a bike-share system provider, using Python visualization techniques. The goal is to figure out what variables possess the most influential power on a bike sharing service. I did the analysis on 2019 year data
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import folium
import glob
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline
files = glob.glob("data/*.csv")
frames=[]
for file in files:
frames.append(pd.read_csv(file, index_col=None, header=0))
/home/abdulrahman/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3146: DtypeWarning: Columns (14) have mixed types.Specify dtype option on import or set low_memory=False. has_raised = await self.run_ast_nodes(code_ast.body, cell_name, /home/abdulrahman/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3146: DtypeWarning: Columns (13,14) have mixed types.Specify dtype option on import or set low_memory=False. has_raised = await self.run_ast_nodes(code_ast.body, cell_name, /home/abdulrahman/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3146: DtypeWarning: Columns (13) have mixed types.Specify dtype option on import or set low_memory=False. has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
df = pd.concat(frames, axis=0, ignore_index=True)
# save new dataset
df.to_csv('bike.csv', index=False)
df = pd.read_csv('bike.csv')
df.head()
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | bike_share_for_all_trip | rental_access_method | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 80825 | 2019-01-31 17:57:44.6130 | 2019-02-01 16:24:49.8640 | 229.0 | Foothill Blvd at 42nd Ave | 37.775745 | -122.213037 | 196.0 | Grand Ave at Perkins St | 37.808894 | -122.256460 | 4861 | Subscriber | No | NaN |
| 1 | 65900 | 2019-01-31 20:58:33.8860 | 2019-02-01 15:16:54.1730 | 4.0 | Cyril Magnin St at Ellis St | 37.785881 | -122.408915 | 134.0 | Valencia St at 24th St | 37.752428 | -122.420628 | 5506 | Subscriber | No | NaN |
| 2 | 62633 | 2019-01-31 18:06:52.9240 | 2019-02-01 11:30:46.5300 | 245.0 | Downtown Berkeley BART | 37.870139 | -122.268422 | 157.0 | 65th St at Hollis St | 37.846784 | -122.291376 | 2717 | Customer | No | NaN |
| 3 | 44680 | 2019-01-31 19:46:09.7190 | 2019-02-01 08:10:50.3180 | 85.0 | Church St at Duboce Ave | 37.770083 | -122.429156 | 53.0 | Grove St at Divisadero | 37.775946 | -122.437777 | 4557 | Customer | No | NaN |
| 4 | 60709 | 2019-01-31 14:19:01.5410 | 2019-02-01 07:10:51.0650 | 16.0 | Steuart St at Market St | 37.794130 | -122.394430 | 28.0 | The Embarcadero at Bryant St | 37.787168 | -122.388098 | 2100 | Customer | No | NaN |
df.sample(5)
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | bike_share_for_all_trip | rental_access_method | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2009651 | 542 | 2019-10-22 08:14:57.8340 | 2019-10-22 08:23:59.9140 | 241.0 | Ashby BART Station | 37.852477 | -122.270213 | 248.0 | Telegraph Ave at Ashby Ave | 37.855956 | -122.259795 | 849 | Customer | No | NaN |
| 1264604 | 427 | 2019-07-28 20:13:12.2270 | 2019-07-28 20:20:19.9980 | 223.0 | 16th St Mission BART Station 2 | 37.764765 | -122.420091 | 336.0 | Potrero Ave and Mariposa St | 37.763281 | -122.407377 | 66 | Subscriber | No | NaN |
| 35143 | 1077 | 2019-01-28 09:04:47.5880 | 2019-01-28 09:22:44.9370 | 104.0 | 4th St at 16th St | 37.767045 | -122.390833 | 15.0 | San Francisco Ferry Building (Harry Bridges Pl... | 37.795392 | -122.394203 | 5477 | Subscriber | No | NaN |
| 745541 | 747 | 2019-04-13 10:13:04.3410 | 2019-04-13 10:25:32.0900 | 55.0 | Webster St at Grove St | 37.777053 | -122.429558 | 54.0 | Alamo Square (Steiner St at Fulton St) | 37.777547 | -122.433274 | 4960 | Customer | No | NaN |
| 959095 | 612 | 2019-05-16 10:15:08.9150 | 2019-05-16 10:25:21.3590 | 218.0 | DeFremery Park | 37.812331 | -122.285171 | 235.0 | Union St at 10th St | 37.807239 | -122.289370 | 3418 | Customer | No | NaN |
# Info about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2506983 entries, 0 to 2506982 Data columns (total 15 columns): # Column Dtype --- ------ ----- 0 duration_sec int64 1 start_time object 2 end_time object 3 start_station_id float64 4 start_station_name object 5 start_station_latitude float64 6 start_station_longitude float64 7 end_station_id float64 8 end_station_name object 9 end_station_latitude float64 10 end_station_longitude float64 11 bike_id int64 12 user_type object 13 bike_share_for_all_trip object 14 rental_access_method object dtypes: float64(6), int64(2), object(7) memory usage: 286.9+ MB
# more information about the dataset
df.describe().transpose()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| duration_sec | 2506983.0 | 807.648330 | 1974.713981 | 60.000000 | 359.000000 | 571.000000 | 887.000000 | 912110.00 |
| start_station_id | 2426249.0 | 146.504749 | 122.317102 | 3.000000 | 47.000000 | 105.000000 | 243.000000 | 498.00 |
| start_station_latitude | 2506983.0 | 37.765064 | 0.136370 | 0.000000 | 37.769305 | 37.780526 | 37.795393 | 45.51 |
| start_station_longitude | 2506983.0 | -122.349919 | 0.308965 | -122.514299 | -122.413004 | -122.398285 | -122.291415 | 0.00 |
| end_station_id | 2424081.0 | 142.704424 | 121.429649 | 3.000000 | 43.000000 | 101.000000 | 239.000000 | 498.00 |
| end_station_latitude | 2506983.0 | 37.764219 | 0.239289 | 0.000000 | 37.770030 | 37.780760 | 37.795873 | 45.51 |
| end_station_longitude | 2506983.0 | -122.345908 | 0.708042 | -122.514287 | -122.411726 | -122.398113 | -122.293400 | 0.00 |
| bike_id | 2506983.0 | 27898.327162 | 114606.651187 | 4.000000 | 1952.000000 | 4420.000000 | 9682.000000 | 999941.00 |
# check Null/Nan Values
missing_values = df.isna().sum()
df.isna().sum()
duration_sec 0 start_time 0 end_time 0 start_station_id 80734 start_station_name 80133 start_station_latitude 0 start_station_longitude 0 end_station_id 82902 end_station_name 82350 end_station_latitude 0 end_station_longitude 0 bike_id 0 user_type 0 bike_share_for_all_trip 243259 rental_access_method 2386145 dtype: int64
# Check for duplicates
df[df.duplicated()==True]
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | bike_share_for_all_trip | rental_access_method |
|---|
df.drop('rental_access_method', axis=1, inplace=True)
df.sample(5)
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | bike_share_for_all_trip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1881476 | 714 | 2019-09-09 07:25:39.0460 | 2019-09-09 07:37:33.2520 | 163.0 | Lake Merritt BART Station | 37.797320 | -122.265320 | 212.0 | Mosswood Park | 37.824893 | -122.260437 | 3385 | Subscriber | No |
| 1633394 | 2043 | 2019-08-13 12:19:34.5150 | 2019-08-13 12:53:37.9700 | 343.0 | Bryant St at 2nd St | 37.783172 | -122.393572 | 6.0 | The Embarcadero at Sansome St | 37.804770 | -122.403234 | 9513 | Subscriber | No |
| 773735 | 513 | 2019-04-10 18:38:32.5680 | 2019-04-10 18:47:06.3580 | 364.0 | China Basin St at 3rd St | 37.772000 | -122.389970 | 16.0 | Steuart St at Market St | 37.794130 | -122.394430 | 7013 | Subscriber | No |
| 182963 | 843 | 2019-01-03 08:15:31.0790 | 2019-01-03 08:29:34.9960 | 193.0 | Grand Ave at Santa Clara Ave | 37.812744 | -122.247215 | 182.0 | 19th Street BART Station | 37.809013 | -122.268247 | 890 | Subscriber | No |
| 1066543 | 518 | 2019-06-28 09:03:56.1590 | 2019-06-28 09:12:34.6810 | 24.0 | Spear St at Folsom St | 37.789677 | -122.390428 | 284.0 | Yerba Buena Center for the Arts (Howard St at ... | 37.784872 | -122.400876 | 2819 | Subscriber | No |
# missing values is small percentage of total data, so I will drop it
df.dropna(inplace=True)
df.isna().sum()
duration_sec 0 start_time 0 end_time 0 start_station_id 0 start_station_name 0 start_station_latitude 0 start_station_longitude 0 end_station_id 0 end_station_name 0 end_station_latitude 0 end_station_longitude 0 bike_id 0 user_type 0 bike_share_for_all_trip 0 dtype: int64
df.sample(5)
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | bike_share_for_all_trip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 740368 | 685 | 2019-04-13 17:58:08.2260 | 2019-04-13 18:09:33.5790 | 239.0 | Bancroft Way at Telegraph Ave | 37.868813 | -122.258764 | 175.0 | 49th St at Telegraph Ave | 37.835946 | -122.262366 | 5450 | Subscriber | No |
| 177447 | 275 | 2019-01-04 01:33:21.4660 | 2019-01-04 01:37:56.7150 | 15.0 | San Francisco Ferry Building (Harry Bridges Pl... | 37.795392 | -122.394203 | 10.0 | Washington St at Kearny St | 37.795393 | -122.404770 | 4417 | Subscriber | Yes |
| 2353773 | 1137 | 2019-11-01 08:46:45.3140 | 2019-11-01 09:05:43.0750 | 76.0 | McCoppin St at Valencia St | 37.771662 | -122.422423 | 24.0 | Spear St at Folsom St | 37.789677 | -122.390428 | 13023 | Customer | No |
| 762469 | 214 | 2019-04-11 18:01:31.0220 | 2019-04-11 18:05:05.9090 | 104.0 | 4th St at 16th St | 37.767045 | -122.390833 | 81.0 | Berry St at 4th St | 37.775880 | -122.393170 | 6328 | Subscriber | No |
| 2014845 | 146 | 2019-10-21 16:29:25.1190 | 2019-10-21 16:31:51.8240 | 8.0 | The Embarcadero at Vallejo St | 37.799953 | -122.398525 | 16.0 | Steuart St at Market St | 37.794130 | -122.394430 | 9682 | Subscriber | No |
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2262979 entries, 0 to 2355470 Data columns (total 14 columns): # Column Dtype --- ------ ----- 0 duration_sec int64 1 start_time object 2 end_time object 3 start_station_id float64 4 start_station_name object 5 start_station_latitude float64 6 start_station_longitude float64 7 end_station_id float64 8 end_station_name object 9 end_station_latitude float64 10 end_station_longitude float64 11 bike_id int64 12 user_type object 13 bike_share_for_all_trip object dtypes: float64(6), int64(2), object(6) memory usage: 259.0+ MB
# change to datetime format
df.start_time = pd.to_datetime(df.start_time)
df.end_time = pd.to_datetime(df.end_time)
# insert new columns
df.insert(0, 'duration_mins', (df.duration_sec/60))
df.insert(1, 'start_day', (df.start_time.dt.strftime('%a')))
df.insert(2, 'end_day', (df.end_time.dt.strftime('%a')))
df.insert(3, 'start_month', (df.start_time.dt.strftime('%b')))
df.insert(4, 'end_month', (df.end_time.dt.strftime('%b')))
df.insert(5, 'start_hour', (df.start_time.dt.hour))
df.insert(6, 'end_hour', (df.end_time.dt.hour))
df.head()
| duration_mins | start_day | end_day | start_month | end_month | start_hour | end_hour | duration_sec | start_time | end_time | ... | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | bike_share_for_all_trip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1347.083333 | Thu | Fri | Jan | Feb | 17 | 16 | 80825 | 2019-01-31 17:57:44.613 | 2019-02-01 16:24:49.864 | ... | Foothill Blvd at 42nd Ave | 37.775745 | -122.213037 | 196.0 | Grand Ave at Perkins St | 37.808894 | -122.256460 | 4861 | Subscriber | No |
| 1 | 1098.333333 | Thu | Fri | Jan | Feb | 20 | 15 | 65900 | 2019-01-31 20:58:33.886 | 2019-02-01 15:16:54.173 | ... | Cyril Magnin St at Ellis St | 37.785881 | -122.408915 | 134.0 | Valencia St at 24th St | 37.752428 | -122.420628 | 5506 | Subscriber | No |
| 2 | 1043.883333 | Thu | Fri | Jan | Feb | 18 | 11 | 62633 | 2019-01-31 18:06:52.924 | 2019-02-01 11:30:46.530 | ... | Downtown Berkeley BART | 37.870139 | -122.268422 | 157.0 | 65th St at Hollis St | 37.846784 | -122.291376 | 2717 | Customer | No |
| 3 | 744.666667 | Thu | Fri | Jan | Feb | 19 | 8 | 44680 | 2019-01-31 19:46:09.719 | 2019-02-01 08:10:50.318 | ... | Church St at Duboce Ave | 37.770083 | -122.429156 | 53.0 | Grove St at Divisadero | 37.775946 | -122.437777 | 4557 | Customer | No |
| 4 | 1011.816667 | Thu | Fri | Jan | Feb | 14 | 7 | 60709 | 2019-01-31 14:19:01.541 | 2019-02-01 07:10:51.065 | ... | Steuart St at Market St | 37.794130 | -122.394430 | 28.0 | The Embarcadero at Bryant St | 37.787168 | -122.388098 | 2100 | Customer | No |
5 rows × 21 columns
# calculate the distance in miles between start and end stations
df['distance_miles'] = 3958.756 * 2 * np.arcsin(np.sqrt(np.sin((np.radians(df['end_station_latitude']) - \
np.radians(df['start_station_latitude']))/2)**2 + np.cos(np.radians(df['start_station_latitude'])) * \
np.cos(np.radians(df['end_station_latitude'])) * np.sin((np.radians(df['end_station_longitude']) - \
np.radians(df['start_station_longitude']))/2)**2))
df.sample(5)
| duration_mins | start_day | end_day | start_month | end_month | start_hour | end_hour | duration_sec | start_time | end_time | ... | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | bike_share_for_all_trip | distance_miles | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1590111 | 10.666667 | Mon | Mon | Aug | Aug | 18 | 18 | 640 | 2019-08-19 18:21:07.650 | 2019-08-19 18:31:48.373 | ... | 37.786928 | -122.398113 | 5.0 | Powell St BART Station (Market St at 5th St) | 37.783899 | -122.408445 | 2739 | Subscriber | No | 0.601738 |
| 1519803 | 10.250000 | Thu | Thu | Aug | Aug | 9 | 9 | 615 | 2019-08-29 09:10:36.692 | 2019-08-29 09:20:52.591 | ... | 37.776639 | -122.395526 | 58.0 | Market St at 10th St | 37.776619 | -122.417385 | 10004 | Subscriber | No | 1.193742 |
| 1948355 | 22.650000 | Tue | Tue | Oct | Oct | 22 | 22 | 1359 | 2019-10-29 22:21:26.406 | 2019-10-29 22:44:05.511 | ... | 37.772301 | -122.393028 | 373.0 | Potrero del Sol Park (25th St at Utah St) | 37.751792 | -122.405216 | 10958 | Customer | No | 1.565628 |
| 982704 | 5.316667 | Sun | Sun | May | May | 3 | 3 | 319 | 2019-05-12 03:42:19.440 | 2019-05-12 03:47:39.266 | ... | 37.755213 | -122.420975 | 107.0 | 17th St at Dolores St | 37.763015 | -122.426497 | 218 | Subscriber | No | 0.617745 |
| 2284853 | 9.366667 | Mon | Mon | Nov | Nov | 14 | 14 | 562 | 2019-11-11 14:37:51.553 | 2019-11-11 14:47:14.082 | ... | 37.337122 | -121.883215 | 304.0 | Jackson St at 5th St | 37.348759 | -121.894798 | 2028 | Subscriber | No | 1.025299 |
5 rows × 22 columns
df.bike_id = df.bike_id.astype(str)
df.start_station_id = df.start_station_id.astype(str)
df.end_station_id = df.end_station_id.astype(str)
df.duration_mins = df.duration_mins.round()
df.duration_mins = df.duration_mins.astype(np.int64)
df.user_type = df.user_type.astype('category')
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2262979 entries, 0 to 2355470 Data columns (total 22 columns): # Column Dtype --- ------ ----- 0 duration_mins int64 1 start_day object 2 end_day object 3 start_month object 4 end_month object 5 start_hour int64 6 end_hour int64 7 duration_sec int64 8 start_time datetime64[ns] 9 end_time datetime64[ns] 10 start_station_id object 11 start_station_name object 12 start_station_latitude float64 13 start_station_longitude float64 14 end_station_id object 15 end_station_name object 16 end_station_latitude float64 17 end_station_longitude float64 18 bike_id object 19 user_type category 20 bike_share_for_all_trip object 21 distance_miles float64 dtypes: category(1), datetime64[ns](2), float64(5), int64(4), object(10) memory usage: 382.0+ MB
# drop unwanted columns
def drop_cols(df, cols_list):
""" Drop cols form dataframe
"""
for col in cols_list:
df.drop(col, axis=1, inplace=True)
drop_cols(df, ['start_time', 'end_time', 'duration_sec'])
df.sample(5)
| duration_mins | start_day | end_day | start_month | end_month | start_hour | end_hour | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | bike_share_for_all_trip | distance_miles | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 815501 | 2 | Sat | Sat | Apr | Apr | 18 | 18 | 60.0 | 8th St at Ringold St | 37.774520 | -122.409449 | 350.0 | 8th St at Brannan St | 37.771431 | -122.405787 | 5380 | Subscriber | No | 0.292514 |
| 154018 | 17 | Wed | Wed | Jan | Jan | 6 | 6 | 144.0 | Precita Park | 37.747300 | -122.411403 | 363.0 | Salesforce Transit Center (Natoma St at 2nd St) | 37.787492 | -122.398285 | 4554 | Subscriber | No | 2.867976 |
| 1733141 | 24 | Fri | Fri | Sep | Sep | 17 | 17 | 258.0 | University Ave at Oxford St | 37.872355 | -122.266447 | 156.0 | Stanford Ave at Hollis St | 37.838443 | -122.288665 | 11991 | Subscriber | No | 2.638029 |
| 988909 | 5 | Fri | Fri | May | May | 17 | 17 | 169.0 | Bushrod Park | 37.846516 | -122.265304 | 241.0 | Ashby BART Station | 37.852477 | -122.270213 | 2849 | Subscriber | No | 0.491284 |
| 18146 | 12 | Wed | Wed | Jan | Jan | 7 | 7 | 349.0 | Howard St at Mary St | 37.781010 | -122.405666 | 81.0 | Berry St at 4th St | 37.775880 | -122.393170 | 2609 | Customer | No | 0.768969 |
Columns:
Understand he number of subscribers/customers in the dataset and which of them are more valuable to the company.
The time for which the bike is rented on an average and the distance in miles travelled by different classes of users.
Understand whether the start time of rentals differ between subscribers and customers.
The locations which sees the most rentals among different classes of users.
# change seaborn style
sns.set_style('darkgrid')
fig, ax = plt.subplots(nrows=2, figsize = [12, 6])
ax[0].hist(data = df, x='duration_mins', bins=np.arange(0, df.duration_mins.mean()+40, 1), color='b', alpha=0.8, edgecolor='k')
ax[1].hist(data = df, x = 'duration_mins', bins=10 ** np.arange(0, 2.0, 0.09), color='b', alpha=0.8, edgecolor='k')
plt.xscale('log')
plt.sca(ax[0])
plt.xticks(np.arange(0, 55, 1))
plt.title("Rental duration in minutes - Normal Scale", fontsize=16)
plt.ylim(0, 180000)
plt.xlabel("Rental duration in minutes", fontsize=16)
plt.ylabel("Count", fontsize=16)
plt.sca(ax[1])
plt.title("Rental duration in minutes - Log Scale", fontsize=16)
plt.xlabel("Rental duration in minutes - Log x scale", fontsize=16)
plt.ylabel("Count", fontsize=16)
plt.tight_layout()
plt.show()
From Normal Plot: Distribution of rental duration is right-skewed, there are rentals for about an hour or so by users.
From the Log Plot: Rental duration is roughly bimodal.
sns.countplot(data = df, x = 'user_type', color='b')
sns.set(rc={'figure.figsize':(12, 6)})
plt.title("Number of Subscribers vs Customers", fontsize=16)
plt.xlabel("User Type", fontsize=16)
plt.ylabel("Count", fontsize=16)
plt.show()
sub_count = df.user_type.value_counts()[0]
cust_count = df.user_type.value_counts()[1]
total = sub_count+cust_count
plt.pie([sub_count/total*100, cust_count/total*100], autopct='%1.1f%%', labels = ['Subscribers', 'Customers'], startangle=90)
plt.title('Percentage of Subscribers vs Customers', fontsize=16)
plt.axis('equal')
plt.show()
# days of week and months in the dataframe
days, months = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], df.start_month.unique()
# Distribution of rentals across days of the week, months and hour
fig, ax = plt.subplots(nrows=3, figsize = [14, 8])
sns.countplot(data = df, x = 'start_hour', color = 'b', alpha=0.8, ax = ax[0])
plt.sca(ax[0])
plt.title("Number Of Rentals vs Hour", fontsize=16)
plt.xlabel("Start Hour", fontsize=16)
plt.ylabel("Count", fontsize=16)
sns.countplot(data = df, x = 'start_day', color = 'b', alpha=0.8, ax = ax[1], order=days)
plt.sca(ax[1])
plt.title("Number Of Rentals vs Day", fontsize=16)
plt.xlabel("Start Day", fontsize=16)
plt.ylabel("Count", fontsize=16)
sns.countplot(data = df, x = 'start_month', color = 'b', alpha=0.8, ax = ax[2])
plt.sca(ax[2])
plt.title("Number Of Rentals vs Month", fontsize=16)
plt.xlabel("Start Month", fontsize=16)
plt.ylabel("Count", fontsize=16)
plt.tight_layout()
plt.show()
# count subscribers in bike share for all scheme
sns.countplot(df[df.user_type == 'Subscriber'].bike_share_for_all_trip)
plt.title("Subscribers in Bike Share for All scheme", fontsize=16)
plt.show()
/home/abdulrahman/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
# percentage of subscribers in bike share for all scheme
no_count = df.bike_share_for_all_trip.value_counts()[0]
yes_count = df.bike_share_for_all_trip.value_counts()[1]
plt.pie([no_count/(no_count+yes_count)*100, yes_count/(no_count+yes_count)*100], autopct='%1.1f%%', labels=['No', 'Yes'], startangle=90)
plt.title("Percentage of Subscribers in Bike Share for All scheme", fontsize=15)
plt.axis('equal')
plt.show()
# calc total distance of both subscribers and customers and their summation
sub_dist = df[df.user_type == 'Subscriber'].distance_miles.sum()
cust_dist = df[df.user_type == 'Customer'].distance_miles.sum()
tot_dist = df.distance_miles.sum()
# percentage of subscribers in bike share for all scheme
plt.pie([sub_dist/tot_dist*100, cust_dist/tot_dist*100], autopct='%1.1f%%', labels=['Subscribers', 'Customers'], startangle=90)
plt.title("Distance travelled", fontsize=16)
plt.axis('equal')
plt.show()
# calc total duration of both subscribers and customers and their summation
sub_dur = df[df.user_type == 'Subscriber'].duration_mins.sum()
cust_dur = df[df.user_type == 'Customer'].duration_mins.sum()
tot_dur = df.distance_miles.sum()
# calc duration percentage of both subscribers and customers
plt.pie([sub_dur/tot_dur*100, cust_dur/tot_dur*100], autopct='%1.1f%%', labels=['Subscribers', 'Customers'], startangle=90)
plt.axis('equal')
plt.title('Rental duration of Subscribers and Customers', fontsize=15)
plt.show()
# duration in minutes traveled by subscribers
fig, ax = plt.subplots(nrows=2, figsize = [12, 5])
sub = df[df.user_type=='Subscriber']
ax[0].hist(sub.duration_mins, bins=np.arange(0, 40, 1), color='b', alpha=0.8, edgecolor='k')
ax[1].hist(sub.duration_mins, bins=10 ** np.arange(0, 2, 0.08), color='b', alpha=0.8, edgecolor='k')
plt.xscale('log')
plt.sca(ax[0])
plt.xticks(np.arange(0, 40, 1))
plt.title("Rental duration for subscribers in minutes - Normal Scale", fontsize=16)
plt.xlabel("Rental duration for subscribers in minutes", fontsize=16)
plt.ylabel("Count", fontsize=16)
plt.sca(ax[1])
plt.title("Rental duration for subscribers in minutes - Log Scale", fontsize=16)
plt.xlabel("Rental duration for subscribers in minutes", fontsize=16)
plt.ylabel("Count", fontsize=16)
plt.tight_layout()
plt.show()
# duration in minutes traveled by customers
fig, ax = plt.subplots(nrows=2, figsize = [12, 5])
sub = df[df.user_type=='Customer']
ax[0].hist(sub.duration_mins, bins=np.arange(0, 40, 1), color='b', alpha=0.8, edgecolor='k')
ax[1].hist(sub.duration_mins, bins=10 ** np.arange(0, 2, 0.08), color='b', alpha=0.8, edgecolor='k')
plt.xscale('log')
plt.sca(ax[0])
plt.xticks(np.arange(0, 40, 1))
plt.yticks(np.arange(0, 18000, 4000))
plt.title("Rental duration for customers in minutes - Normal Scale", fontsize=16)
plt.xlabel("Rental duration for customers in minutes", fontsize=16)
plt.ylabel("Count", fontsize=16)
plt.sca(ax[1])
plt.title("Rental duration for customers in minutes - Log Scale", fontsize=16)
plt.xlabel("Rental duration for customers in minutes", fontsize=16)
plt.ylabel("Count", fontsize=16)
plt.tight_layout()
plt.show()
# duration in minutes traveled by all trip program
fig, ax = plt.subplots(nrows=2, figsize = [12, 5])
sub = df[df.bike_share_for_all_trip=='Yes']
ax[0].hist(sub.duration_mins, bins=np.arange(0, 40, 1), color='b', alpha=0.8, edgecolor='k')
plt.sca(ax[0])
plt.xticks(np.arange(0, 40, 1))
plt.title("Rental duration for all users in minutes - Normal Scale", fontsize=16)
plt.xlabel("Rental duration for all users in minutes", fontsize=16)
plt.ylabel("Count", fontsize=16)
ax[1].hist(sub.duration_mins, bins=10 ** np.arange(0, 2, 0.08), color='b', alpha=0.8, edgecolor='k')
plt.sca(ax[1])
plt.xscale('log')
plt.title("Rental duration for all users in minutes - Log Scale", fontsize=16)
plt.xlabel("Rental duration for all users in minutes", fontsize=16)
plt.ylabel("Count", fontsize=16)
plt.tight_layout()
plt.show()
# distance in miles traveled by all types of users
fig, ax = plt.subplots(nrows=3, figsize = [16, 10])
sub = df[df.user_type=='Subscriber']
bins = np.arange(0, 3.5, 0.05)
ax[0].hist(sub.distance_miles, bins=bins, color='b', alpha=0.8, edgecolor='k')
plt.sca(ax[0])
plt.xticks(np.arange(0, 4, 0.1))
plt.title("Subscribers", fontsize=16)
plt.xlabel("Distance in miles", fontsize=16)
plt.ylabel("Count", fontsize=16)
cust = df[df.user_type=='Customer']
ax[1].hist(cust.distance_miles, bins=bins, color='b', alpha=0.8, edgecolor='k')
plt.sca(ax[1])
plt.xticks(np.arange(0, 4, 0.1))
plt.title("Customers", fontsize=16)
plt.xlabel("Distance in miles", fontsize=16)
plt.ylabel("Count", fontsize=16)
ax[2].hist(df[df.bike_share_for_all_trip == 'Yes'].distance_miles, bins=bins, color='b', alpha=0.8, edgecolor='k')
plt.sca(ax[2])
plt.xticks(np.arange(0, 4, 0.1))
plt.title("Bike Share", fontsize=16)
plt.xlabel("Distance in miles", fontsize=16)
plt.ylabel("Count", fontsize=16)
plt.tight_layout()
plt.show()
data = df[df.duration_mins < 20]
ax = sns.catplot(data=data, y='duration_mins', col="user_type", kind='box')
ax.fig.suptitle("Duration in minutes of Rentals for Subscribers and Customers", y=1.05, fontsize=16)
plt.show();
ax = sns.catplot(data=data, y='duration_mins', col="user_type", kind='violin')
ax.fig.suptitle("Duration in minutes of Rentals for Subscribers and Customers", y=1.05, fontsize=16)
plt.show();
## Which station sees the most Bike Share for customers users rentals starting point
a = df[df.user_type == 'Customer'].start_station_name.value_counts()[:40]
sns.barplot(a, a.index, palette='autumn')
sns.set(rc={'figure.figsize':(16, 10)})
plt.yticks(fontsize=16)
plt.title("Rentals Count vs Station for customers", fontsize=16)
plt.xlabel("Rentals Count", fontsize=16)
plt.ylabel("Station", fontsize=16)
plt.tight_layout()
plt.show()
## Which station sees the most Bike Share for subscribers users rentals starting point
a = df[df.user_type == 'Subscriber'].start_station_name.value_counts()[:40]
sns.barplot(a, a.index, palette='autumn')
sns.set(rc={'figure.figsize':(16, 10)})
plt.yticks(fontsize=16)
plt.title("Rentals Count vs Station for subscribers", fontsize=16)
plt.xlabel("Rentals Count", fontsize=16)
plt.ylabel("Station", fontsize=16)
plt.tight_layout()
plt.show()
## Which station sees the most Bike Share for all trip program users rentals starting point
a = df[df.bike_share_for_all_trip == "Yes"].start_station_name.value_counts()[:40]
sns.barplot(a, a.index, palette='autumn')
sns.set(rc={'figure.figsize':(16, 10)})
plt.yticks(fontsize=16)
plt.title("Rentals Count vs Station for bike share", fontsize=16)
plt.xlabel("Rentals Count", fontsize=16)
plt.ylabel("Station", fontsize=16)
plt.tight_layout()
plt.show()
# day and hours rental distribution for subscribers
a = sns.FacetGrid(df[df.user_type=='Customer'], col='start_day', col_wrap=4, col_order=days, sharey=True)
a = (a.map(plt.hist, 'start_hour', bins=np.arange(0, 24, 1), color='b', alpha=0.8, edgecolor='k')
.set_axis_labels("Day of the week", "Start Hour"))
a.set(xticks=np.arange(0, 24, 2))
plt.suptitle("Start Hour vs Day of the Week for Subscribers", y=1.1)
plt.show()
# heat map for number of rentals with hour and day of the week
a = df[df.user_type=='Subscriber'].groupby(['start_hour', 'start_day'])['bike_id'].size()
a = a.reset_index().pivot('start_hour', 'start_day', 'bike_id')[days]
heat_map = sns.heatmap(a, cmap = 'viridis_r', annot=True, robust=True, fmt='d', linewidths=.5)
plt.title('Subscriber', y=1)
plt.xlabel('Weekday', labelpad = 18)
plt.ylabel('Start Time Hour', labelpad = 18)
plt.show()
# new dataframe with the only needed columns for location
df2 = df[['start_station_name', 'start_station_latitude', 'start_station_longitude', 'end_station_name',
'end_station_latitude', 'end_station_longitude', 'user_type', 'bike_share_for_all_trip']]
df2.start_station_latitude, df2.start_station_longitude = df2.start_station_latitude.astype(str), df2.start_station_longitude.astype(str)
df2.end_station_latitude, df2.end_station_longitude = df2.end_station_latitude.astype(str), df2.end_station_longitude.astype(str)
/home/abdulrahman/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py:5168: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self[name] = value
from IPython.display import Image
base_html='maps_html/'
base_imgs='maps_imgs/'
fm = folium.Map(location=[37.8272, -122.2913], tiles='Stamen Terrain', zoom_start=10)
c = df2[df2.user_type == 'Customer'].start_station_latitude.value_counts()[:40].index
test = df2[df2.user_type == 'Customer']
d = test.start_station_longitude.value_counts()[:40].index
def add_bikes_to_map(mp, stations_latitudes, stations_longitude):
"""
Add red bikes to map with locations' latitude and longitude
"""
for a, b in zip(c, d):
pop = df2[df2.start_station_latitude==a].start_station_name.unique()[0]
tooltip = 'Click me!'
icon=folium.Icon(color='black', icon='bicycle', icon_color="red", prefix='fa', icon_size=(0, 0))
folium.Marker(location = [float(a), float(b)], popup=pop, tooltip='click here', icon=icon).add_to(mp)
add_bikes_to_map(fm, c, d)
map_cust_file = 'map_customers.html'
map_cust_img = 'map_customers.png'
fm.save(base_html+map_cust_file)
Image(filename=base_imgs+map_cust_img)
fm = folium.Map(location=[37.8272, -122.2913], tiles='Stamen Terrain', zoom_start=10)
c = df2[df2.user_type == 'Subscriber'].start_station_latitude.value_counts()[:40].index
test = df2[df2.user_type == 'Subscriber']
d = test.start_station_longitude.value_counts()[:40].index
add_bikes_to_map(fm, c, d)
map_subs_file = 'map_subscribers.html'
map_subs_img = 'map_subscribers.png'
fm.save(base_html+map_cust_file)
Image(filename=base_imgs+map_subs_img)
fm = folium.Map(location=[37.8272, -122.2913], tiles='Stamen Terrain', zoom_start=10)
c = df2[df2.bike_share_for_all_trip=='Yes'].start_station_latitude.value_counts()[:40].index
d = df2[df2.bike_share_for_all_trip=='Yes'].start_station_longitude.value_counts()[:40].index
add_bikes_to_map(fm, c, d)
map_bikeshare_file = 'map_bikeshare.html'
map_bikeshare_img = 'map_bikeshare.png'
fm.save(base_html+map_cust_file)
Image(filename=base_imgs+map_bikeshare_img)
Subscribers rented the bikes for less time than customers but travelled more they seems to be in hurry.
Relationship between start Day, start Hour and user type which give a hint about the nature of the user type.
Subscribers uses bikes as transportation option.
Subscribers are regular customers who are making rides to/from work or school, renting a bike at 7-9am and 4-6pm on weekdays
Customers rent bikes for exploring the Bay area and they could be tourists.
Customers could be tourists who use bikes to explore the Bay area mainly on weekends.